# Load the needed packages
library(ggplot2)
library(dplyr)
library(lubridate)
library(survival)
library(ggsurvfit)
library(gtsummary)
library(here)
library(survminer)
library(broom)
library(forestploter)
library(tidyr)Fixed Time Survival Analysis
# Load example data
df <- colonThis analysis focus on survival following the chemotherapy treatment for colon cancer.
About the sample data
The data come from the colon dataset, available from the survival package. These data include information from a clinical trial on the effectiveness of two different types of chemotherapy (levamisole and levamisole+5-fluorouracil) compared to controls (i.e. no chemotherapy treatment) on survival from stage B/C colon cancer.
There are two rows per person in the dataset, one for cancer recurrence and one for death, indicated by the event type (etype) variable (etype==1 corresponds to recurrence and etype==2 to death). In analysis below, I only focus on analysing death as an outcome.
Note: there is some incomplete values on the differ variable, for simplicity, in the below analysis, I drop those incomplete values.
Some important variables:
rx: Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU
sex: 1=male
age: in years
obstruct: obstruction of colon by tumour
perfor: perforation of colon
adhere: adherence to nearby organs
nodes: number of lymph nodes with detectable cancer
time: days until event or censoring
status: censoring status
differ: differentiation of tumour (1=well, 2=moderate, 3=poor)
extent: Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg: time from surgery to registration (0=short, 1=long)
node4: more than 4 positive lymph nodes
etype: event type: 1=recurrence,2=death
Data cleaning
- Filter records with death outcome
- Drop incomplete values on the
diffvariable - Label the
diffandextentvariables - Stratify the
agevariable
df1 <- df %>%
filter(etype == 2) %>% # Filter to deaths
filter(!is.na(differ)) %>%
mutate(
differF = factor(differ, levels = 1:3, labels = c("well","moderate","poor")),
extentF = factor(extent, levels = 1:4, labels = c("submucosa","muscle","serosa","contiguous")),
ageF = factor(ifelse(age<70, 1, 2), levels = 1:2, labels = c('18-69', '70+')),
sexF = factor(sex, levels = 0:1, labels = c("Female","Male")),
surgF = factor(surg, levels = 0:1, labels = c("short", "long"))
)EDA
Table 1: Summary of demographics and disease status by treatment group
| Obs | Lev | Lev+5FU | Total | |
|---|---|---|---|---|
| Total (column denominator) | 308 (100%) | 300 (100%) | 298 (100%) | 906 (100%) |
| Age | ||||
| Mean, (SD) | 59, (12) | 60, (12) | 60, (12) | 60, (12) |
| Median, (IQR) | 61, (53, 68) | 61, (53, 69) | 62, (52, 70) | 61, (53, 69) |
| Range | 18, 85 | 27, 83 | 26, 81 | 18, 85 |
| Sex | ||||
| Female | 146 (47%) | 131 (44%) | 161 (54%) | 438 (48%) |
| Male | 162 (53%) | 169 (56%) | 137 (46%) | 468 (52%) |
| Obstruction of colon | 62 (20%) | 59 (20%) | 54 (18%) | 175 (19%) |
| Perforation of colon | 9 (3%) | 10 (3%) | 8 (3%) | 27 (3%) |
| Adherence to nearby organs | 45 (15%) | 47 (16%) | 39 (13%) | 131 (14%) |
| Differentiation of tumour | ||||
| well | 27 (9%) | 37 (12%) | 29 (10%) | 93 (10%) |
| moderate | 229 (74%) | 219 (73%) | 215 (72%) | 663 (73%) |
| poor | 52 (17%) | 44 (15%) | 54 (18%) | 150 (17%) |
| Extent of local spread | ||||
| submucosa | 7 (2%) | 3 (1%) | 10 (3%) | 20 (2%) |
| muscle | 36 (12%) | 35 (12%) | 31 (10%) | 102 (11%) |
| serosa | 248 (81%) | 251 (84%) | 246 (83%) | 745 (82%) |
| contiguous | 17 (6%) | 11 (4%) | 11 (4%) | 39 (4%) |
| Time until registration | ||||
| short | 218 (71%) | 222 (74%) | 222 (74%) | 662 (73%) |
| long | 90 (29%) | 78 (26%) | 76 (26%) | 244 (27%) |
| >4 positive lymph nodes | 87 (28%) | 87 (29%) | 78 (26%) | 252 (28%) |
| Days until death/censored | ||||
| Mean, (SD) | 1,599, (857) | 1,615, (894) | 1,796, (866) | 1,669, (876) |
| Median, (IQR) | 1,854, (759, 2,274) | 1,910, (741, 2,383) | 2,092, (977, 2,472) | 1,978, (802, 2,375) |
| Range | 113, 3,214 | 24, 3,329 | 23, 3,309 | 23, 3,329 |
| Death | 165 (54%) | 154 (51%) | 122 (41%) | 441 (49%) |
Figure 1: Kaplan-Meier suvival plots for key predictors
Key Findings::
Survival following treatment for colon cancer was not differentiated by age (p=0.58), sex (p=0.49) or perforation of colon (p=0.56).
However, survival outcomes did differ across the categories of the remaining variables, with better survival rates associated with the Lev+5FU treatment, unobstructed colon, no adherence to nearby organs, well or moderately differentiated tumour, local spread limited to the submucosa or muscle, shorter time until registration and fewer positive lymph nodes.
Table 2. Hazard ratios and 95% Confidence Intervals for univariable and multivariable Cox regression models
| Characteristic |
Univariable
|
Multivariable
|
||||
|---|---|---|---|---|---|---|
| HR1 | 95% CI1 | p-value | HR1 | 95% CI1 | p-value | |
| Treatment | ||||||
| Obs | — | — | — | — | ||
| Lev | 0.96 | 0.77, 1.19 | 0.7 | 0.98 | 0.79, 1.23 | 0.9 |
| Lev+5FU | 0.70 | 0.55, 0.88 | 0.002 | 0.70 | 0.55, 0.88 | 0.003 |
| Sex | 1.00 | 0.83, 1.21 | >0.9 | |||
| Age (70+ years) | 1.08 | 0.87, 1.34 | 0.5 | |||
| Obstruction of colon | 1.30 | 1.03, 1.63 | 0.025 | 1.29 | 1.03, 1.63 | 0.028 |
| Perforation of colon | 1.17 | 0.70, 1.95 | 0.6 | |||
| Adherence to nearby organs | 1.35 | 1.05, 1.73 | 0.018 | 1.19 | 0.92, 1.53 | 0.2 |
| Differentiation of tumour | ||||||
| well | — | — | — | — | ||
| moderate | 1.05 | 0.76, 1.45 | 0.8 | 0.93 | 0.67, 1.29 | 0.7 |
| poor | 1.70 | 1.18, 2.46 | 0.005 | 1.36 | 0.93, 1.97 | 0.11 |
| Extent of local spread | ||||||
| submucosa | — | — | — | — | ||
| muscle | 1.83 | 0.65, 5.15 | 0.3 | 1.34 | 0.47, 3.79 | 0.6 |
| serosa | 3.25 | 1.21, 8.71 | 0.019 | 2.18 | 0.81, 5.87 | 0.12 |
| contiguous | 5.04 | 1.75, 14.5 | 0.003 | 3.07 | 1.06, 8.94 | 0.039 |
| Time until registration | 1.26 | 1.03, 1.54 | 0.027 | 1.27 | 1.03, 1.56 | 0.022 |
| >4 positive lymph nodes | 2.58 | 2.13, 3.12 | <0.001 | 2.49 | 2.05, 3.02 | <0.001 |
| 1 HR = Hazard Ratio, CI = Confidence Interval | ||||||
Key Findings:
The estimates confirmed that although treatment with levamisole did not improve outcomes compared to the control group (HR = 0.98, 95% CI = 0.79-1.23), the hazard of death was 30% lower among patients treated with levamisole+5-fluorouracil (HR = 0.70, 95% CI = 0.55-0.88). Other factors significantly associated with increased hazard of death included obstruction of the colon (HR = 1.29, 95% CI = 1.03-1.63), local spread to contiguous regions (HR = 3.07, 95% CI = 1.06-8.94), longer time between surgery and registration (HR = 1.27, 95% CI = 1.03-1.56) and more than 4 positive lymph nodes (HR = 2.49, 95% CI = 2.05-3.02).
Figure 2: Forest plot for hazard ratios and 95% confidence intervals for multivariable Cox regression models
Assessing the proportional harzard assumptions
Using a chi-squared test based on Schoenfeld residuals:
H0: Covariate effect is constant (proportional) over time HA: Covariate effect changes over time
The null hypothesis of proportional hazard is tested for each covariate individually and jointly as well.
If p is <0.05 then there is evidence for violation of the proportional hazards assumption.
cox.zph(cox_model)#> chisq df p
#> rx 2.3509 2 0.30869
#> obstruct 6.2760 1 0.01224
#> adhere 0.0775 1 0.78074
#> differF 16.0442 2 0.00033
#> extentF 7.3605 3 0.06125
#> surg 0.0247 1 0.87521
#> node4 5.8260 1 0.01579
#> GLOBAL 36.5773 11 0.00014
The test confirms that the proportional hazards assumption is violated for obstruction of colon (p=0.01), differentiation of tumour (p < 0.001) and marginally for extent of local spread (p=0.06). The test also suggests that the variable indicating more than 4 positive lymph nodes also violates the assumption (p=0.016); the global test also indicates the assumption is invalid (p<0.001).
A plot of a smoothed curve over the Schoenfeld residuals
Note: It is actually plotting the coefficient for each predictor at each time point over time). We want to see a flat line over time.
(Side note: If we have a large data, we will be able to detect very small changes of coefficients over time. So if the change in the coefficient is not large enough to be clinically meaningfully, it can perhaps be ignored as well).
plot(cox.zph(cox_model))Dealing with proportional hazards violation as a sensitivity Analysis
Startify by the non-PH variable
In the stratified Cox model: - The cox model is estimated separately in each stratum - Drawback: we cannot quantify the effect of the stratification variable on survival (i.e., no coefficient will be estimated).
Because these variables are not primary factors of interest we can control for them using stratification. The resulting estimated hazard ratios and 95% confidence intervals are presented in Table 3. As can be seen, the proportional hazards assumption is met in this model.
mvModelStratified <- coxph(Surv(time, status) ~ rx + strata(obstruct) + adhere + strata(differF) + strata(extentF) + surg + strata(node4), data = df1)
cox.zph(mvModelStratified)#> chisq df p
#> rx 2.24767 2 0.33
#> adhere 1.50847 1 0.22
#> surg 0.00211 1 0.96
#> GLOBAL 3.87358 4 0.42